Goto

Collaborating Authors

 linguistic structure


Large language models transition from integrating across position-yoked, exponential windows to structure-yoked, power-law windows

Neural Information Processing Systems

Modern language models excel at integrating across long temporal scales needed to encode linguistic meaning and show non-trivial similarities to biological neural systems. Prior work suggests that human brain responses to language exhibit hierarchically organized "integration windows" that substantially constrain the overall influence of an input token (e.g., a word) on the neural response. However, little prior work has attempted to use integration windows to characterize computations in large language models (LLMs). We developed a simple word-swap procedure for estimating integration windows from black-box language models that does not depend on access to gradients or knowledge of the model architecture (e.g., attention weights). Using this method, we show that trained LLMs exhibit stereotyped integration windows that are well-fit by a convex combination of an exponential and a power-law function, with a partial transition from exponential to power-law dynamics across network layers. We then introduce a metric for quantifying the extent to which these integration windows vary with structural boundaries (e.g., sentence boundaries), and using this metric, we show that integration windows become increasingly yoked to structure at later network layers. None of these findings were observed in an untrained model, which as expected integrated uniformly across its input. These results suggest that LLMs learn to integrate information in natural language using a stereotyped pattern: integrating across position-yoked, exponential windows at early layers, followed by structure-yoked, power-law windows at later layers. The methods we describe in this paper provide a general-purpose toolkit for understanding temporal integration in language models, facilitating cross-disciplinary research at the intersection of biological and artificial intelligence.





Learning How Learning Works

Communications of the ACM

In 2023, Noam Chomsky, considered the founder of modern linguistics, wrote that LLMs "learn humanly possible and humanly impossible languages with equal facility." However, in the Mission: Impossible Language Models paper that received a Best Paper award at the 2024 Association of Computational Linguistics (ACL) conference, researchers shared the results of their testing of Chomsky's theory, having discovered that language models actually struggle with learning languages with non-standard characters. Rogers Jeffrey Leo John, CTO of DataChat Inc., a company that he cofounded while working at the University of Wisconsin as a data science researcher, said the Mission: Impossible paper challenged the idea that LLMs can learn impossible languages as effectively as natural ones. "The models [studied for the paper] exhibited clear difficulties in acquiring and processing languages that deviate significantly from natural linguistic structures," said John. "Further, the researchers' findings support the idea that certain linguistic structures are universally preferred or more learnable both by humans and machines, highlighting the importance of natural language patterns in model training. This finding could also explain why LLMs, and even humans, can grasp certain languages easily and not others."


Random Text, Zipf's Law, Critical Length,and Implications for Large Language Models

arXiv.org Machine Learning

We study a deliberately simple, fully non-linguistic model of text: a sequence of independent draws from a finite alphabet of letters plus a single space symbol. A word is defined as a maximal block of non-space symbols. Within this symbol-level framework, which assumes no morphology, syntax, or semantics, we derive several structural results. First, word lengths follow a geometric distribution governed solely by the probability of the space symbol. Second, the expected number of words of a given length, and the expected number of distinct words of that length, admit closed-form expressions based on a coupon-collector argument. This yields a critical word length k* at which word types transition from appearing many times on average to appearing at most once. Third, combining the exponential growth of the number of possible strings of length k with the exponential decay of the probability of each string, we obtain a Zipf-type rank-frequency law p(r) proportional to r^{-alpha}, with an exponent determined explicitly by the alphabet size and the space probability. Our contribution is twofold. Mathematically, we give a unified derivation linking word lengths, vocabulary growth, critical length, and rank-frequency structure in a single explicit model. Conceptually, we argue that this provides a structurally grounded null model for both natural-language word statistics and token statistics in large language models. The results show that Zipf-like patterns can arise purely from combinatorics and segmentation, without optimization principles or linguistic organization, and help clarify which phenomena require deeper explanation beyond random-text structure.


Beyond Individuals: Collective Predictive Coding for Memory, Attention, and the Emergence of Language

arXiv.org Artificial Intelligence

This commentary extends the discussion by Parr et al. on memory and attention beyond individual cognitive systems. From the perspective of the Collective Predictive Coding (CPC) hypothesis -- a framework for understanding these faculties and the emergence of language at the group level -- we introduce a hypothetical idea: that language, with its embedded distributional semantics, serves as a collectively formed external representation. CPC generalises the concepts of individual memory and attention to the collective level. This offers a new perspective on how shared linguistic structures, which may embrace collective world models learned through next-word prediction, emerge from and shape group-level cognition.


Tokens with Meaning: A Hybrid Tokenization Approach for NLP

arXiv.org Artificial Intelligence

Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitabฤฑ), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.


Linguistic Blind Spots of Large Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) are the foundation of many AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability to perform fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study the performance of recent LLMs on fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our results provide insights to inform future advancements in LLM design and development.


Fundamental Principles of Linguistic Structure are Not Represented by o3

arXiv.org Artificial Intelligence

Instead of scaling to unprecendented levels of compute via architectures that are fundamentally grounded in token prediction, a return to more traditional design features of the human mind (predicate-argument structure, variable binding, constituent structure, minimal compositional binding; Donatelli & Koller 2023) may be needed to orchestrate a more reliable expertise in human language (Ramchand 2024). This could be implemented by forms of neuro-symbolic approaches. Still, it is also certainly true that mainstream theoretical linguistics (e.g., the minimalist enterprise) was in some ways ill-equipped to successfully predict which patterns of linguistic activity might be (un)approachable by LLMs. To illustrate, a potential weakness in this direction with respect to recent generative grammar theorizing has been the underestimation of the extent to which lexical information drives composition. This type of information may permit LLMs to abductively infer certain elements of grammatical rules, in whatever format this ultimately takes (Ramchand 2024). Future research should more carefully apply the tools of linguistics to isolate specific sub-components of syntax that might be in principle achievable by language models, given specific design features. For instance, with LLMs "complete recovery of syntax might be very di`icult computationally" (Marcolli et al. 2025: 13), even if we assume that attention modules can in principle "satisfy the same algebraic structure" as what Marcolli et al. postulate as being necessary for syntaxsemantics interface mappings.